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Abstract. We introduce the first grammar-compressed representation of a sequence that supports 
searches in time that depends only logarithmically on the size of the grammar. Given a text T[l..ii] 
that is represented by a (context-free) grammar of n (terminal and nonterminal) symbols and size 
(measured as the sum of the lengths of the right hands of the rules), a basic grammar-based represen- 
tation of T takes Nlgn bits of space. Our representation requires 2A'' Ign + iVlgM + enlgn-l- o{N Ig n) 
bits of space, for any < e < 1. It can find the positions of the occ occurrences of a pattern of 
length m in T in O ((m^/e) Ig + occlgn) time, and extract any substring of length ^ of T in time 

0{£ + hlg{N/h)), where h is the height of the grammar tree. 

1 Introduction and Related Work 

Grammar-based compression is an active area of research that dates from at least the seventies. A given 
sequence r[l..u] over alphabet [l.-cr] is replaced by a hopefully small (context-free) grammar Q that generates 
just the string T. Let n be the number of grammar symbols, counting terminals and nonterminals. Let N 
be the size of the grammar, measured as the sum of the lengths of the right-hand sides of the rules. Then 
the grammar-compressed representation of T requires Nlgn bits, versus the u Ig cr bits required by a plain 
representation. 

Grammar-based methods can achieve universal compression [21]. Unlike statistical methods, that exploit 
frequencies to achieve compression, grammar-based methods exploit repetitions in the text, and thus tlic^y arc 
especially suitable for compressing highly repetitive sequence collections. These collections, containing long 
identical substrings, possibly far away from each other, arise when managing software repositories, versionccl 
documents, temporal databases, transaction logs, periodic publications, and computational biology sequence 
databases. 

Finding the smallest grammar Q* that represents a given text T is NP-complcte [33,9]. Moreover, the 
smallest grammar is never smaller than an LZ77 parse [35] of T. A simple method to achieve an 0{\gu)- 
approximation to the smallest grammar size is to parse T using LZ77 and then to convert it into a grammar 
[33]. A more sophisticated approximation achieves ratio 0{lg{u/N*)), where N* is the size of Q* . 

While grammar-compression methods are strictly inferior to LZ77 compression, and some popular grammar- 
based compressors such as LZ78 [36] , Re-Pair [24] and Sequitur [30] , can generate sizes much larger than the 
smallest grammar [9] , some of those methods (in particular Re-Pair) perform very well in practice, both in 
classical and repetitive settings.^ 

In reward, unlike LZ77, grammar compression allows one to decompress arbitrary substrings of T almost 
optimally [16, 6]. The most recent result [6] extracts any T[p,p + £ — 1] in time 0{£ + Igu). Unfortunately, 
the representation that achieves this time complexity requires 0{Nlgu) bits, possibly proportional but in 
practice many times the size of the output of a grammar-based compressor. On the practical side, applications 
like Comrad [23] achieve good space and time performance for extracting substrings of T. 

More ambitious than just extracting arbitrary substring from T is to ask for indexed searches, that is, 
finding all the occ occurrences in T of a given pattern P[l..m]. Self-indexes are compressed text represen- 
tations that support both operations, extract and search, in time depending only polylogarithmically on u. 

* This work was partially supported by Google U.S. /Canada PhD Fellowship and David R. Cheriton Scholarships 
program (first author), and by Millennium Institute for Cell Dynamics and Biotechnology (ICDB), Grant ICM 
P05-001-F, Mideplan, Chile (second author). 

^ See the statistics in http://pizzachili.dcc.uchile.cl/repcorpus. html for a recent experiment. 



They have appeared in the hist deeade [28], and have; foensed mostly on statistical compression. As a result, 
they work well on classical texts, but not on repetitive collections [25] . Some of those self- indexes have been 
adapted to repetitive collections [25] , but they cannot reach the compression ratio of the best grammar-based 
methods. 

Searching for patterns on grammar-compressed text has been faced mostly in sequential form [2], that 
is, scanning the whole grammar. The best result [20] achieves time 0{N + m'^ + occ). This may be o{u), but 
still linear in the size of the compressed text. There exist a few self-indexes based on LZ78-likc compression 
[15,3,32], but LZ78 is among the weakest grammar-based compressors. In particular, LZ78 has been shown 
not to be competitive on highly repetitive collections [25]. 

The only self-index supporting general grammar compressors [13] operates on "straight-line programs" 
(SLPs), where the right hands of the rules are of length 1 or 2. Given such a grammar they achieve, among 
other tradeoffs, 3n Ig n -|- n Ig u bits of space and 0{m{m + h)lg^n) search time, where h is the height of 
the parse tree of the grammar. A general grammar of n symbols and size N can be converted into a SLP of 
N — n rules. 

More recently, a self- index based on LZ77 compression has been developed [22]. Given a parsing of T 
into n phrases, the self-index uses nlgn + 2nlgu -I- 0{nlga) bits of space, and searches in time 0{m^h + 
(m -|- occ)lgn), where h is the nesting of the parsing. Extraction requires 0{ih) time. Experiments on 
repetitive collections [11, 12] show that the grammar-based compressor [13] can be competitive with the best 
classical sclf-indcx adapted to repetitive collections [25] but, at least that particular implementation, is not 
competitive with the LZ77-based self-index [22]. 

Note that the search time in both self-indexes depends on h. This is undesirable as h is only bounded by 
n. That kind of dependence has been removed for extracting text substrings [6], but not for searches. 

Our main contribution is a new representation of general context-free grammars. The following theorem 
summarizes its properties. Note that the search time is independent of h. 

Theorem 1. Let a sequence T[l..u] be represented by a context free grammar with n symbols, size N and 
height h. Then, for any < e < 1. there exists a data structure using at most 2N\gn -|-A^lgM-|-enlgn-f- 
o(A^lgri) bits of space that finds the occ occurrences of any pattern P[l..m] in T in time 

O (jyUi^/e) Ig + occlgnj . It can extract any substring of length i from T in time 0{£ + h\g{N/h)). 

The structure can be built in 0{u + N\gN) time and 0{u\gu) bits of working space. 

In the rest of the paper we describe how this structure operates. First, we preprocess the grammar to 
enforce several invariants useful to ensure our time complexities. Then we use a data structure for labeled 
binary relations [13] to find the "primary" occurrences of P, that is, those formed when concatenating 
symbols in the right hand of a rule. To get rid of the factor h in this part of the search, we introduce a 
new technique to extract the first m symbols of the expansion of any nonterminal in time 0{m). To find 
the "secondary" occurrences (i.e., those that are found as the result of the nonterminal containing primary 
occurrences being mentioned elsewhere), we use a pruned representation of the parse tree of T. This tree is 
traversed upwards for each secondary occurrence to report. The grammar invariants introduced ensure that 
those traversals amortize to a constant number of steps per occurrence reported. In this way we get rid of 
the factor h on the secondary occurrences too. 

2 Basic Concepts 

2.1 Sequence Representations 

Our data structures use succinct representations of sequences. Given a sequence S of length N, drawn from 
an alphabet of size n, we need to support the following operations: 

— access{S,i): retrieves the symbol S[i]. 

— ranka{S,i): number of occurrences of a in S'[l..z]. 

— selecta{S,j): position where the jth a appears in S. 
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In the case where ri = 2, Raman ct al. [31] proposed two compressed representations of S, that are 
useful when the number n' of Is in S is small (or large, which is not the case in this paper). One is 
called a "fully indexable dictionary" (FID). It takes n'lg^ + 0(n' + A/' Ig Ig iV/ Ig Af) bits of space and 
supports all the operations in constant time. A weaker one is an "indexable dictionary" (ID), that takes 
n'lg^ + 0{n' + IglgiV) bits of space and supports in constant time queries access{S,i), rank{S,i) if 
S[i] = 1, and selecti{S,j). 

For general sequences, the wavelet tree [18] requires A'^lgn + o(A^) bits of space [17] and supports all three 
operations in O(lgn) time. Another representation, by Barbay et al. [4], requires at most A'^ Ig n + o( Ig n) 
bits and solves access{S, i) in constant time and select{S, j) in time 0(lg Ig n), or vice versa. Query rank{S, i) 
takes time O(lglgn). 

2.2 Labeled Binary Relations 

A labeled binary relation corresponds to a binary relation TZ C A x B, where A = [l..ni] and B = [l..n2], 
augmented with a function C : A x B ^ LU {-L}, L = that defines labels for each pair in TZ, and _L 
for pairs that are not in TZ. Let us identify A with the columns and B with the rows in a table. We describe 
a simplification of a representation of binary relations [13, 14], for the case of this paper where each element 
of A is associated to exactly one element of B, so \TZ\ = n\. We use a string S's[l..ni] over alphabet [l..n2], 
where SB[i\ is the element of B associated to column i. A second string S'£[l..ni] on alphabet [1..^] is stored, 
so that Sc[i] is the label corresponding to the pair represented by Ssli]- 

If we use a wavelet tree for Sb (see Section 2.1) and a plain string representation for Sc, the total space 
is ni(lgn2 +lg^) + 0(ni) bits. With this representation we can answer, among others, the following queries 
of interest in this paper. 

— Find the label of the element h associated to a given a, Sc[a\, in 0(1) time. 

— Enumerate the k pairs (a, b) gTZ such that ai < a < 02 and bi < b < 62, in 0{{k + 1) lgn2) time. 

2.3 Succinct Tree Representations 

There are many succinct tree representations for trees T with N nodes. Most take 2A^ + o{N) bits of space. 
In this paper we use one called DFUDS [5], which in particular answers in constant time the following 
operations. Node identifiers v are associated to a position in [1..2A/']. 

— node{p): the node with preorder mmiber p. 

— preorder{v): the preorder number of node v. 

— leafrank{v): number of leaves to the left of v. 

— numleaves{v): number of leaves below v. 

— parent{v): the parent of v. 

— child{v, k): the fcth child of v. 

— nextsibling{v): the next sibling of w. 

— degree{v): the number of children of v. 

— depth(v): the depth of v. 

— level- ancestor {v,k): the fcth ancestor of v. 

The DFUDS representation is obtained by traversing the tree in DFS order and appending to a bitmap 
the degree of each node, written in unary. 

3 Preprocessing and Representing the Grammar 

Let Q he a grammar that generates a single string T[l..u\, formed by n (terminal and nonterminal) symbols. 
The a < n terminal symbols come from an alphabet U = [l,<j],'^ and then Q contains n — a rules of the 

* Non-contiguous alphabets can be handled with some extra space, as shown in previous work [14]. 
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form Xi — > ai, one per nonterminal. This on is called the right-hand side of the rule. We call N = '^\ai\ 
the size of Q. Note it holds a < N, as, the terminals must appear in the right-hand sides. We assume all 
the nonterminals are used to generate the string; otherwise unused rules can be found and dropped in 0{N) 
time. 

We preprocess Q as follows. First, for each terminal symbol a G IJ present in Q we create a rule Xa — >■ a, 
and replace all other occurrences of a in the grammar by Xa- As a result, the grammar contains exactly n 
nonterminal symbols X = {Xi, . . . , each associated to a rule — )• a,, where ai G S or ai G . We 
assume that Xn is the start symbol. 

Any rule Xi — >■ where \ai\ < 1 (except for Xa — > a) is removed by replacing Xi by ai everywhere, 
decreasing n and without increasing N. 

We further preprocess G to enforce the property that any nonterminal Xi, except X„ and those Xi 
a G S, must be mentioned in at least two right-hand sides. We traverse the rules of the grammar, count 
the occurrences of each symbol, and then rewrite the rules, so that only the rules of those Xi appearing 
more than once (or the excepted symbols) are rewritten, and as we rewrite a right-hand side, we replace any 
(non-excepted) Xi that appears once by its right-hand side Qj. This transformation takes 0{N) time and 
does not alter N (yet it may reduce n) . 

Note that n is now the number of rules in the transformed grammar Q. We will still call N the size of 
the original grammar (the transformed one has size at most N + a). 

We call J^(Xi) the single string generated by Xi, that is J^{Xi) = a if Xj — >■ a and J^(Xi) = T{Xi^) . . . J^(Xi^) 
if X,^X,^... X,^. g generates the text T = C{g) = T{X„). 

Our last preprocessing step, and the most expensive one, is to renumber the nonterminals so that i < j <4> 
J^{XiY'''^' < T{Xjy^'", where S^^"" is string 5* read backwards. The usefulness of this reverse lexicographic 
order will be apparent later. The sorting can be done in time 0{u + nlgn) and 0{ulgu) bits of space [13, 
14], which dominates the previous time complexities. Let us say that X^ became Xg after the reordering. 

We define now a structure that will be key in our index. 

Definition 1. The grammar tree of g is a general tree Tg with nodes labeled in X. Its root is labeled Xg. 
Let as = Xs;^ ■ ■ ■ Xs^. . Then the root has k children labeled X^^ , Xgi, . The subtrees of these children are 
defined recursively, left to right, so that the first time we find a symbol Xi, we define its children using ai. 
However, the next times we find a symbol Xi, we leave it as a leaf of the grammar tree (if we expanded it the 
resulting tree would be the parse tree ofT, with u nodes). Also symbols Xa — >■ a are not expanded but left as 
leaves. We say that Xi is defined in the only internal node of Tg labeled Xi . 

Since each right-hand side ai ^ a G S is written once in the tree, plus the root Xg, the total number of 
nodes in 7g is N +1. 

The grammar tree partitions T in a way that is useful for finding occurrences, using a concept that dates 
back to Karkkainen [19], who used it for Lempcl-Ziv parsings. 

Definition 2. Let Xi-^, Xi^, . . . be the nonterminals labeling the consecutive leaves of Tg ■ Let Ti = J-{Xi.), 
then T = T1T2 ... is a partition ofT according to the leaves ofTg- An occurrence of pattern P inT is called 
primary if it spans more than one Ti, and secondary if it is inside some Ti. 

Figure 1 shows the reordering and grammar tree for a grammar generating the string alabaralalabarda" . 

Our self-index will represent g using two main components. A first one represents the grammar tree Tg 
using a DFUDS representation (Section 2.3) and a sequence of labels (Section 2.1). This will be used to 
extract text and decompress rules. When augmented with a secondary trie Ts storing leftmost /rightmost 
paths in Tg, the representation will expand any prefix/suSix of a rule in optimal time [16]. 

The second component in our self-index corresponds to a labeled binary relation (Section 2.2), where 
B = X and A is the set of proper suffixes starting at positions j -|- 1 of rules a^: (ajj], + 1)) will be 
related for all Xi -> and 1 < j < \ai\. This binary relation will be used to find the primary occurrences 
of the search pattern. Secondary occurrences will be tracked in the grammar tree. 
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Fig. 1. On top left, a grammar Q generating string "alabaralalabarda" . On top right, our reordering of the grammar 
and strings J^{Xi). In the middle, the grammar tree Tg in black; the whole parse tree includes also the grayed part. 
On the bottom we show our bitmap L (Section 4.2). 



4 Extracting Text 

We first describe a simple structure that extracts the prefix of length £ of any rule in 0{i + h) time. We 

then augment this structure to support extracting any substring of length £ in time 0{t + h\g{N/h)), and 
finally augment it further to retrieve the prefix or suffix of any rule in optimal 0{£) time. This last result is 
fundamental for supporting searches, and is obtained by extending the structure proposed by Gasieniec et 
al. [16] for SLPs to general context-free grammars generating one string. The improvement does not work for 
extracting arbitrary substrings, as in that case one has to find first the nonterminals that must be expanded. 
This subproblem is not easy, especially in little space [6]. 

As anticipated, we represent the topology of the grammar tree Tg using DFUDS [5]. The sequence of 
labels associated to the tree nodes is stored in preorder in a sequence X[l..iV+l], using the fast representation 
of Section 2.1 where we choose constant time for access{X,i) = X[i] and O(lglgn) time for selecta{X, j). 

We also store a bitmap y[l..n] that marks the rules of the form Xi ^ a E S with a 1-bit. Since the 
rules have been renumbered in (reverse) lexicographic order, every time we find a rule X^ such that Y[i] — 1, 
we can determine the terminal symbol it represents as a = ranki ( Y, i) in constant time. In our example of 
Figure 1 this vector is F = 101011100. 

4.1 Expanding Prefixes of Rules 

Expanding a rule Xi that does not correspond to a terminal is done as follows. By the definition of Tg, the 
first left-to- right occurrence of Xi in scqiience X corresponds to the definition of Xi] all the rest are leaves 
in Tg. Therefore, v = node{selectxi{X, 1)) is the node in Tg where Xi is defined. We traverse the subtree 
rooted at v in DPS order. Every time we reach a leaf u, we compute its label Xj ~ X[preorder{u)], and 
either output a terminal if Y[j] = 1 or recursively expand Xj. This is in fact a traversal of the parse tree 
starting at node v, using instead the grammar tree. Such a traversal takes 0{£ + hy) steps [13, 14], where 
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hy < h is the height of the; parsing subtree rooted at v. In particular, if we extract the whole rule Xi we pay 
0{i) steps, since we have removed unary paths in the preprocessing of G and thus v has £ > hy leaves in 
the parse tree. The only obstacle to having constant-time steps are the queries selectxi{X, 1). As these are 
only for the position 1, we can have them precomputed in a sequence F[l..n] using n[lg A''] = nlgn + 0{N) 
further bits of space. 

The total space required for Tg, considering the DFUDS topology, sequence X, bitmap Y, and sequence 
F, is iV Ig n + n Ig n + o{N Ig n) bits. We reduce the space to N\gn + S n\gn + o{N Ig n), for any < 5 < 1, 
as follows. Form a sequence X'[1..N — n + I] where the first position of every symbol Xj in X has been 
removed, and mark in a bitmap Z[1..N + 1], with a 1, those first positions in X. Replace our sequence 

by a permutation 7r[l..n] so that selectxi{X,l) = F[i] ~ selecti{Z,'K[i]). Now we can still access any 
X[i] = X'[ranko{Z,i)] if Z[i] = 0. For the case Z[i\ = 1 we have X[i] = TT~^[ranki{Z,i)]. Similarly, 
selectXi{X,j) = selecto{Z, selectXi{X' ,j — 1)) for j > 1. Then use Z, tt, and X' instead of F and S. 

All the operations retain the same times except for the access to Tr~^. We use for tt a representation by 
Munro et al. [27] that takes {1 + 5)n\gn bits and computes any n[i\ in constant time and any 7r~^[j] in 
time 0{l/5), which will be the cost to access X. Although this will have an impact later, we note that for 
extraction we only access X at leaf nodes, where it always takes constant time.^ 

4.2 Extracting Arbitrary Substrings 

In order to extract any given substring of T, we add a bitmap L[l..u + 1] that marks with a 1 the first 
position of each Tj in T (see Figure 1). We can then compute the starting position of any node v G Tg as 
selecti{L, leafrank{v) + 1). 

To extract T\p,p + i — 1], we binary search the starting position p from the root oiTg. If we arrive at a 
leaf that does not represent a terminal, we go to its definition in Tg, translate position p to the area below 
the new node v, and continue recursivcily. At some point we finally reach the position p, and from there on 
we extract the symbols rightwards. Just as before, the total number of steps is 0{£ + h). However, the h 
steps require binary searches. As there are at most h binary searches among the children of different tree 
nodes, and there arc + 1 nodes, in the worst case the binary searches cost 0{hlg{N/h)), thus the total 
cost is 0{i + hlg{N/h)). 

The number of ones in L is at most N. Since we only need selecti on L, we can use an ID representation 
(see Section 2.1), requiring Nlg{u/N) + 0{N + Iglgw) = N\g{u/N) + 0{N) bits (since N > Igu in any 
grammar). Thus the total space becomes A''lgn + N\g{u/N) + Snlgn + o{Nlgn) bits. 

4.3 Optimal Expansion of Rule Prefixes and Suffixes 

Our improved version builds on the proposal by Gasieniec et al. [16]. We show how to extend their rep- 
resentation using succinct data structures so that we can handle general grammars instead of only SLPs. 
Following their notation, call S{Xi) the string of labels of the nodes in the path from any node labeled Xi 
to its leftmost leaf in the parse tree (we take as leaves the nonterminals Xa S X-, not the terminals a G S). 
We insert all the strings S{XiY'^^ into a trie Ts- Note that each symbol appears only once in Ts [16], thus it 
has n nodes. Again, we represent the topology of Ts using DFUDS. However, its sequence of labels X5[l..n] 
turns out to be a permutation in [l..n], for which we use again the representation by Munro et al. [27] that 
takes (1 + e)nlgn bits and computes any Xs[i\ in constant time and any Xg^[j] in time 0(l/e). 

We can determine the first terminal in the expansion of Xi, which labels node v G Ts, as follows. Since 
the last symbol in S{Xi) is a nonterminal Xa representing some a G S, it follows that Xi descends in 
Ts from Xa, which is a child of the root. This node is Va = level- ancestor {v,depth{v) — 1). Then a = 
ranki{Y, Xs[preorder{Va)])- Figure 2 shows an example of this particular query in the trie for the grammar 
presented in Figure 1. 

^ Nonterminals Xa — >■ a do not have a definition in Tg, so they are not extracted from X nor represented in tt, thus 
they are accessed in constant time. They can be skipped from 7r[l..n] with bitmap Y, so that in fact tt is of length 
n — u an is accessed as 7r[ranA;o(y, i)]; for tt"^ we actually use selecto{Y,w~^[j]). 
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Fig. 2. Example of the trie of leftmost paths for the grammar of Figure 1. The arrow pointing from X2 to Xi ilustrates 
the procedure to determine the first terminal symbol generated by X2. 



A prefix of Xj is extracted as follows. First, we obtain the corresponding node G Ts as ii = 
Then we obtain the leftmost symbol of v as explained. The remaining symbols descend from the second and 
following children, in the parse tree, of the nodes in the upward path from a node labeled Xi to its leftmost 
leaf, or which is the same, of the nodes in the downward path from the root of Tg to v. Therefore, for 
each node w in the list level- ancestor {v, depth{v) — 2), . . . ,parent{v), v, we map w to its definition u G Tg, 
u = node{selectxj{X , 1)) where Xj = Xs[preorder{w)]. Once u is found, we recursively expand its children, 
from the second to the last, by mapping them back to 7s, and so on. By charging the cost to the new symbol 
to expand, and because there are no unary paths, it can be seen that we carry out 0{i) steps to extract 
the first £ symbols. Moreover, the extraction is real-time [16]. All costs per step are constant except for the 
0(l/e) to access Xg^. 

For extracting suffixes of rules in 0, we need another version of Ts that stores the rightmost paths. This 
leads to our first result (choosing 6 = o(l)). 

Lemma 1. Let a sequence T[l..v] be represented by a context free grammar with n symbols, size N , and 
height h. Then, for any < e < 1, there exists a data structure using at most N\gn + N\g{u/N) + (2 + 
e)nlgn + o(A^lgn) bits of space that extracts any substring oj length I fromT in time 0{i + h\g{N/h)), and 
a prefix or suffix of length i of the expansion of any nonterminal in time 0{£/e). 

5 Locating Patterns 

A secondary occurrence of P inside a leaf of Tg labeled by a symbol Xi occurs as well in the internal node of 
Tg where X^ is defined. If that occurrence is also secondary, then it occurs inside a child Xj of X^, and we 
can repeat the argument with Xj until finding a primary occurrence inside some X/.. This shows that all the 
secondary occurrences can be found by first spotting the primary occurrences, and then finding all the copies 
of the nonterminal AT^ that contain the primary occurrences, as well as all the copies of the nonterminals 
that contain X/-, recursively. 

The strategy [19] to find the primary occurrences of P = piP2 ■ ■ -Pm is to consider the m — 1 partitions 
P = Pi ■ P2, Pi = pi . . .pi and P2 = Pi+i ■ ■ -Prm for 1 < i < m. For each partition we will find all 
the nonterminals Xi- — )■ X^iXh^ ...Xk^ such that Pi is a suffix of some T{Xki) and P2 is a prefix of 
J^(Afe.^J . . . J"(Afe^). This finds each primary occurrence exactly once. The secondary occurrences are then 
tracked in the grammar tree Tg.^ 

5.1 Finding Primary Occurrences 

As anticipated at the end of Section 3, we store a binary relation 7?, C A x P to find the primary occurrences. 
It has n rows labeled Xj, for all Aj e A* = P, and N—n columns. Each column, denoted (i, j + 1), corresponds 

^ If m = 1 we can just find all the occurrences of Xp^ in Tg and track its secondary occurrences. 
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to a distinct proper siifSx ai[j + 1..] of a right-hand side a.i. The labels belong to [I..N + 1]. The relation 
contains one pair per column: (Q;i[j], + 1)) £ TZ for all 1 < i < n and 1 < j < \ai\. Its label is the 
preorder of the {j + l)th child of the node v GTg where Xi is defined. The space for the binary relation is 
(A^ - n) (Ig n + Ig iV) + 0{N) bits. 

Recall that, in our preprocessing, we have sorted X according to the lexicographic order of J^{XiY'^'". We 
also sort all the pairs lexicographically according to the sufRxes T {a i[j + 1])J^ {a 2]) . . . J^{ai[\ai\]). 

This can be done in 0{u+N Ig N) time in a way similar to how X was sorted: Each pair , labeled p, can 

be associated to the substring T[selecti{L, rankleaf {node{p)) + 1) . . . select\{L, rankleaf(v) + numleaves{v) + 
1) — 1], where v is the parent of node{p). Then we can proceed as in previous work [13, 14]. Figure 3 illustrates 
how TZ is used, for the grammar presented in Figure 1. 

Given Pi and P2 , we first find the range of rows whose expansions finish with Pi , by binary searching for 
prev ^Yie expansions T{XiY^'". Each comparison in the binary search needs to extract |Pi| terminals from 
the suffix of J-{Xi). According to Lemma 1, this takes 0(\Pi\/e) time. Similarly, we binary search for the 
range of columns whose expansions start with Pj- Each comparison needs to extract £ = IP2I terminals from 
the prefix of J^{ai [j + l])J^{ai [j + 2]) — Let r be the column we wish to compare to P2 . We extract the label 
p associated to the column in constant time (recall Section 2.2). Then we extract the first £ symbols from the 
expansion of node{p) G Tg. If node{p) does not have enough symbols, we continue with nextsihlmg(p), and 
so on, until we extract £ symbols or we exhaust the sufBx of the rule. According to Lemma 1, this requires 
time 0(|P2|/e). Thus our two binary searches require time ©((m/e) Ig A^). 

This time can be further improved by using the same technique as in previous work [14]. The idea is to 
sample phrases at regular intervals and store the sampled phrases in a Patricia tree [26]. We first search for the 
pattern in the Patricia tree, and then complete the process with a binary search between two sampled phrases 
(we first verify the correctness of the Patricia search by checking that our pattern is actually within the range 

found). By sampling every Igulglgn/ Ign phrases, the resulting time for searching becomes O {m\g 

and we only require o(A''lgn) bits of extra space, as the Patricia tree needs O(lgu) bits per node. 

Once we have identified a range of rows [oi, 02] and a range of cohimns \hi, 62], we retricive all the points 
in the rectangle and their labels, each in time O(lgn), according to Section 2.2. The parents of all the nodes 
node{p) e 7g, for each point p in the range, correspond to the primary occurrences. In Section 5.2 we show 
how to report primary and secondary occurrences starting directly from those node{p) positions. 

Recall that we have to carry out this search for m — 1 partitions of P, whereas each primary occurrence 
is found exactly once. Calling occ the number of primary occurrences, the total cost of this part of the search 
isO((mVe)lg(j|^) + occ Ign). 
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Fig. 3. Relation TZ for the grammar presented in Figure 1. The highhghted ranges correspond to the result of searching 
for b ■ ar, where the single primary occurrence corresponds to X2. 
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5.2 Tracking Occurrences Through the Greimmar Tree 

The remaining problem is how to track ah the secondary occurrences triggered by a primary occurrence, and 
how to report the positions where these occur in T. Given a primary occurrence for partition P = Pi ■ P2 
located at m = node{p) G Tg, we obtain the starting position of P in T by moving towards the root while 
keeping count of the offset between the beginning of the current node and the occiirrence of P. Initially, for 
node u itself, this is I = — |-Pi|- Now, as long as u is not the root, we set I ^ l + selecti{L, rankleaves{u) + l) — 
selecti{L, rankleaves{parent{u)) + 1) and then u parent{u). When we arrive at the root, the occurrence 
of P starts at I. 

It seems like we are doing this h times in the worst case, since we need to track the occurrence up to 
the root. In fact we might do so for some symbols, but the total cost is amortized. Evey time we move 
from u to V = parent{u), we know that X[v] appears at least once more in the tree. This is because of our 
preprocessing (Section 3), where we force rules to appear at least twice or be removed. Thus v defines X[v], 
but there are one or more leaves labeled X\v], and we have to report the occurrences of P inside them all. 
For this sake we carry out select x[v] {X, i) for i = 1,2 . . . until spotting all those occurrences (where P occurs 
with the current offset /). We recursively track them to the root of Tg to find their absolute position in T, 
and recursively find the other occurrences of all their ancestor nodes. The overall cost amortizes to 0(1) 
steps per occurrence reported, as we can charge the cost of moving from m to f to the other occurrence of 
V. If we report occ secondary occurrences we carry out 0{occ) steps, each costing O(lglgn) time. We can 
thus use S = 0(l/lglgn) (Section 4.1) so that the cost to access X[v] does not impact the space nor time 
complexity. 

By adding up the space of Lemma 1 with that of the labeled binary relation, and adding up the costs, 
we have our central result. Theorem 1. 

6 Conclusions 

We presented the first grammar-based text index whose locate time does not depend on the height of the 

grammar. There are previous results on generating balanced grammars to compress text, as for example the 
ones proposed by Rytter [33] and Sakamoto [34]. These representations allow previous indexing techniques 
to guarantee sublinear locating times, yet these techniques introduce a penalty in the size of the grammar. 
Our index also extends the grammar-based indexing techniques to a more general class of grammars than 
SLPs, the only class explored so far in this scenario. 

We note that in our index each primary occurrence is reported in O(lgn) time, whereas each secondary 
ones requires just O(lglgn) time. The complexity of primary occurrences is dominated by the time to report 
points in a range using our binary relation representation. We believe this can be lowered up to O(lglgn), 
in exchange for using more space. For example, Bose et al. [7] represent an n x n grid with n points within 
nig n+o{n\g n) bits, so that each point in a range can be reported in time 0(lgn/ Iglgn); using 0((l/e)nlgn) 
bits the time can be reduced to O(lg'n) for any constant < e < 1 [10,29,8]; and using O(nlgnlglgn) bits 
one can achieve time O(lglgn) [1,8] (all these solutions have a small additive time that is not relevant for 
our application). It seems likely that these structures can be extended to represent our n x N grid with N 
points (i.e., our string Sb)- In the case of Bose et al. this could even be asymptotically free in terms of space. 

Alternatively, instead of speeding up the reporting of primary occurrences, we can slow down that of 
secondary occurrences so that they match, and in exchange reduce the space. For example, one of our largest 
terms in the index space owes to the need of storing the phrase lengths in Tg- By storing just the n internal 
node lengths and one out of Ign lengths at the leaves of 7g, we reduce N lg{u/N) bits of space in our index 
to (n + (TV - n) / Ig n) lg(w/(n + {N - n) / \gn)) < {n + N/ Ig n) lg{u/N) + o{N Ig n). Note this penalizes the 
extraction time by an O(lgn) factor in the worst case. 

Several questions remain open, for example: Is it possible to lower the dependence on m to linear, as 
achieved in some LZ78-based schemes [32]? Is it possible to reduce the space to N\gn + o{N\gn), that is, 
asymptotically the same as the compressed text, as achieved on statistical-compression-based self-indexes 
[28]? Is it possible to remove h from the extraction complexity within less space than the current solutions 
[6]? 
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